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Abstract 

Here we report on development of a high order finite element code for the solution of the 
shallow water equations on the massively parallel computer MP-1104. We have compared 
the parallel code to the one available on the Amdahl serial computer. It is suggested that 
one uses a low order finite element to reap the benefit of the massive number of processors 
available. 



1 Introduction 

The shallow water equations are first order nonlinear hyperbolic partial differential equations 
having many applications in Meteorology and oceanography. These equations can be used 
in studies of tides and surface water run-off. They may also be used to study large-scale 
waves in the atmosphere and ocean if terms representing the effects of the Earth’s rotation 
are included. See review article by Neta (1992). 

Indeed, it had become customary, in developing new numerical methods for weather 
prediction or oceanography, to study first the simpler nonlinear shallow water equations, 
which possess the same mixture of slow and fast waves as the more complex baroclinic 
three-dimensional primitive equations. One of the issues associated with the numerical 
solution of the shallow water equations is how to treat the nonlinear advective terms (Cullen 
and Morton, 1980, Navon, 1987). In this paper the two-stage Galerkin method combined 
with a high accuracy compact approximation to the first derivative is used. The method 
was developed by Navon (1987). See also Navon (1979„, 1979j, 1983). Our work here is to 
discuss porting issues of finite element onto a massively parallel machine. Section 2 discusses 
the algorithm, section 3 discusses the MasPar hardware and software. In section 4 we detail 
our numerical experiments and compare the results to the code running on the Amdahl serial 
computer. 



2 Finite Element Solution 

The barotropic nonlinear shallow-water equations on a limited-area domain of a rotating 
earth (using the /?-plane assumption) have the following form: 

u t -f- uu x + vu y + <p x — fv = 0 

v t + uv x + vv y + </?„ + /u = 0 0<x<jL, 0 < y < D, t > 0 

Vt + (<PU) X + (<pv)y = 0 . 

Here u and v are the velocity components in the x and y directions respectively, / is the 
Coriolis parameter approximated by the /? plane as 
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where /?, / 0 , are constants and <p = gh is the geopotential height. Periodic boundary 
conditions are assumed in the x direction and rigid boundary conditions (v = 0) are imposed 
in the y-direction. The domain is a cylindrical channel simulating a latitude belt around 
the earth (see e.g. Hinsman, 1975). The finite element approximation leads to systems of 
ODES which can be finite differenced in time (see e.g. Douglas and Dupont, 1970). In the 
two stage Galerkin (originally proposed by Cullen, 1974), we let any of the 4 derivatives in 
the nonlinear terms be approximated by the compact Numerov scheme, i.e. for 

du 

z ™ = !Tx 

we have 

^0 [ z i+2 + 16 z «+ 1 + 36z,- + 16z;_i 4- 2 ,- 2 ] = 
gj^[ — 5 u ,-_2 — 32u,_i -f 32u, + i -f 5 u, +2 ] 

Similarly for z xv , z yu and z yv .The approximation of requires an interpolation of the bound- 
ary values vo,vx + i 



vo — 4vi — 6 v 2 + 4u 3 — 



V N+1 ~ 4Vat — 6vat-1 + 4uat_ 2 — 



dv 

dy 1 
dv 



dy 



N 



— 25vi -|- 48t>2 — 36 u 3 4- I 6 V 4 — 3t>s 

m 



3u^_4 — 16u/v-_3 -I- 36u^_2 — 48v^_t 4 25v^ 

m 



This stage will require a solution of a pentadiagonal system. For the second stage, we let w 
be any of the four nonlinear terms and we solve a tridiagonal system. For 



w = vz 



we have 



-(to,--! + 4 Wj + Wj+i) = + VjZj _! + 

Vj+ \ z 3 + v i z i + 1 + V i+I z :+1 + 6 VjZj) 



This two stage approximation yields O ( h 8 ) approximation to the derivatives u x ,u y ,v x and 
V 

Now the approximation of the shallow water equations becomes 

JM(uf 1 - uj) + Ai[(« 2t „)' + - />•] = Ai 7 T 21 

M( f " +1 - »") + A([(« v); + Uj*' Mi + /i“" +1 ] = AiTT 31 
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where 



- <$) - '-At A',( V .J + ' + y>J) = 0 

^3. = i(As+' + *;,) 

My = JJ^VjVidA 

*« = xfL v ‘ v *^+xjj A v > v *%h 

K «' - ?//x’i£ k<m 

Kir = £ Jl^V^A 

K » = zJJ A ^ dA 
K » = zJL*T£ v * iA 

and where K are the finite element shape functions. 

u m = u n+1/2 = ^u n - iu"" 1 + 0 (At) 2 ) 



and similarly for v*. 

Schuman (1957) filter was applied every 12 time steps to the v component of velocity in 
order to recover the higher accuracy of the method. 

Since the two-stage Galerkin method does not conserve integral invariants (Cullen [1979]) 
we apply an aposteriori technique using an augmented Lagrangian nonlinearly constrained 
optimization approach for enforcing the conservation of integral invariants of the shallow 
water equations (see Navon and deVilliers (1983) and Navon (1983)). 

3 System Overview 

The MasPar family of massively parallel processing systems consists of arrays of IK to 16K 
processing elements (PE), a scalar control unit (ACU) and a UNIX subsystem. Architec- 
turally, each PE is a custom 64-bit RISC processor with 48 32-bit registers and 64 KB of data 
memory. All PEs execute instructions which are broadcast from the ACU on data stored in 
their local memory. Although there is only a single instruction stream, the processors have 
a number of autonomies, including the ability to generate independent addresses for indirect 
loads and stores to memory. 
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The PEs share data using two communication mechanisms: the xnet and the router. 
The xnet is an eight-way nearest neighbor mesh that is used for structured communications 
such as stencil operations in finite difference codes. The router is a multi-stage circuit- 
switched network for global or random communication patterns. I/O to and from the PEs is 
transferred via the router to an external memory buffer called I/O RAM. From I/O RAM, 
data can asynchronously be transferred to a wide variety of devices such as disk arrays, 
frame buffers, or other machines. The MasPar Disk Array (MPDA) provides up to 22 
GB of formatted capacity as a true UNIX file system. The UNIX subsystem provides the 
programming and run-time environment to users. 

3.1 MasPar Software 

The MasPar system is programmed in either MPL, a parallel extension to ANSI C, or Mas- 
Par Fortran, an implementation of Fortran 90. In MasPar Fortran (MPF) parallel operations 
are expressed with the Fortran 90 (F90) array extensions which treat entire arrays as manip- 
ulatable objects, rather than requiring them to be iterated through one element at a time. 
F90 has also added a significant number of intrinsic libraries; operations such as matrix 
multiplication and dot product are part of the language. Since Fortran 90 is a standard 
defined by the ANSI/ISO committees, programs are architecture independent and can be 
transparently moved to other platforms. 

F ortranll Fortran 90 

doi = 1,256 a = t + c 

doj = 1,256 
a(i,j) = b(i,j) + c(t, j) 

enddo 

enddo 

The Fortran 90 code can be run on any computer with a F90 compiler. On a scalar machine 
such as a workstation, the arrays will be added one element at a time; just as if it had been 
written in Fortran 77. On a vector machine, the number of elements added at a time is 
based on the vector length; a machine w r ith a vector length of 64 will add 64 array elements 
at once. The MasPar machine acts like a vector machine with a very long vector. On a 16K 
MasPar machine, 16384 arrays elements are added simultaneously. 

MasPar provides key routines in math, signal, image, and data display libraries. The 
Math Library (MPML) contains a number of high-level linear algebra solvers, including 
a general dense solver with partial pivoting, a Cholesky solver, a conjugate solver with 
preconditioning, and an out-of-core solver. MPML also includes a set of highly-tuned linear 
algebra building blocks, analogous to BLAS on vector machines, from which the user can 
develop additional solvers. The Data Display Library provides a convenient interface to 
graphically display data from within a program as it is executing. 

The MasPar Programming Environment (MPPE) is an integrated, graphical environment 
for developing, debugging, and tuning applications. MPPE provides a rich set of graphical 
tools that allow the user to interactively control and visualize a program’s behavior. The 
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statement level profiler allows the user to quickly identify the compute-intensive sections of 
the program while the machine visualizer details the use of hardware resources. Each of 
these tools are continuously available without having to recompile, even if a program has 
been compiled with optimizations. 



4 Program 

The program is modular and is complemented with easily reachable switches controlling 
print and plot options. The Input to the program consists of a single line containing the 
following six parameters: 

DT - the time step in seconds (F5.2) 

NLIMIT - total number of time steps (15) 

MF - number of time steps between printing solution (15) 

NOUTU - to print (1) or not to print (0) the u-component 

NOUTV - to print (1) or not to print (0) the v-component 

NPRINT - to print (1) or not to print (0) the global nodal numbers of each triangular 
elements and the indices and node coordinates of the nonzero entries of the global matrix. 

The main program initializes all variables and then reads the only data card of the 
program. It then proceeds to index and label the nodes and the elements, thus setting up 
the integration domain. This is done by subroutine NUMBER. 

Subroutine CORRES determine the nonzero locations in the global matrix and stores 
them in array LOCAT. The initial fields of height and velocity are set up by subroutine 
INCOND. The derivatives of the shape functions (Vj) are calculated in AREA A. A compact 
storage scheme for the banded and sparse global matrices is implemented in subroutine 
ASSEM. The method is based on the fact that the maximum number of triangles supporting 
any node is six. Three different types of element matrices (3 x 3) will be required for assembly 
in the global matrices. 

A switch, denoted NSWITCH is set for selecting between the different types of element 
matrices. After setting up the time independent global matrices the program proceeds to 
the main do-loop which performs the time-integration and which is executed once for every 
new time-step. 

As the solution of the nonlinear constrained optimization problem of enforcing conser- 
vation of the nonlinear integral invariants requires scaling of the variables, the scaling is 
performed in the main program as well as in subroutine INCOND. 

In the main integration loop the simulation time is set up and adjusted and then the 
subroutines ASSEM and MAMULT set up and assemble the global matrices which then are 
added up in a matrix equation, first for the continuity equation and in a similar manner for 
the u and u-momentum equations. 

Subroutine SOLVER then is called to solve the resulting system of linear equations (of 
block tridiagonal form) by the conjugate gradient square. 

The new field values for the geopotential and velocities, , v ) T j +1 respectively, are 

used immediately as obtained in solving the coupled shallow-water equations system. For 
the u and u-momentum equations, the new two-stage Numerov-Galerkin scheme is imple- 
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mented. Separate routines are set up for the x and y-derivatives advection terms, DX and 
DY respectively. Subroutine DX implements the two-stage Numerov-Galerkin algorithm de- 
scribed previously for the advective terms in the u and r-momentum equations involving the 
i-derivative. 

In the first stage it calculates the 0(h 8 ) accurate generalized-spline approximation to 
the ( du/dx ) first derivative by calling upon subroutine CYCPNT which solves a periodic 
pentadiagonal system of linear equations generated by the spline approximation. 

In the second stage it implements the second part of the Numerov-Galerkin algorithm 
for the nonlinear advective term u(du/dx) and solves a cyclic tridiagonal system by calling 
upon subroutine CYCTRD. Subroutine DY implements the two-stage Numerov-Galerkin 
algorithm described previously for the advective terms in the u and r-momentum equations 
involving the y— derivative. In its first stage it calculates the 0(h 6 ) accurate generalized- 
spline approximation to the ( du/dy ) first derivative by calling upon subroutine PENTDG 
which solves the usual pentadiagonal system of linear equations generated by the generalized- 
spline approximation. 

In the second stage subroutine DY implements the second part of the Numerov-Galerkin 
algorithm for the nonlinear advective term u(du/dy) and solves the Galerkin product by 
calling upon subroutine NCTRD to solve a special tridiagonal system. 

The boundary conditions are implemented by subroutine BOUND. Periodically, a Schu- 
man filtering procedure is implemented for the v-component of velocity only, by calling 
subroutine SMOOTH. The integral invariants are calculated at each time-step by calling 
subroutine LOOK. If the variations in the integral invariants exceed the allowable limits 
or &Z-, the Augmented-Lagrangian nonlinear constrained optimization procedure is 
activated. The unconstrained optimization uses the conjugate-gradient subroutine E14DBF 
of the NAG(1982) scientific library. Subroutine E14DBF calls a user-supplied subroutine 
FUNCT which evaluates the function value and its gradient vector as well as subroutine 
MONIT whose purpose is merely to print out different minimization parameters. 

After a predetermined number of steps, subroutine OUT is called, which in turn calls 
upon the subroutines LOOK to calculate the integral invariants. Practically 4-5 augmented- 
Lagrangian minimization cycles were determined to be sufficient. 

We ran the program under MPPE and the following table shows the CPU time used by 
some of the routines. All others require less than 5% each. Therefore we have decided to par- 
allelize ASSEM, MAMULT, SOLVER (switching from Gauss Seidel to Conjugate Gradient 
Square). Other subroutines we parallelized are: 

CORRES, INCOND, LOOK, MONIT, NUMBER and AREAA. 

After this, the most time consuming routines become E14DBF and FUNCT. These are 
required only if the integral constraints are not conserved. Therefore if the mesh is fine, 
these routines will not be called. Our numerical experiments confirmed that these two 
routines were called only in the coarsest grid case. 

The next set include: DX, DY, CYCTRD, CYCPNT, NCTRD, PENTDG, TRIDG, and 
SMOOTH. We have decided not to try at this point to parallelize these or BOUND. We 
have ran this program on the MP-1104 (4096 processors) on a variety of grid sizes. The 
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Routines 


CPU 


SOLVER 


32% 


ASSEM 


25% 


MAMULT 


14% 


CORRES 


5% 


BOUND 


5% 



Table 1: CPU time used by some routines 

original program was also ran on the Amdahl 5990/500 serial computer. All computations 
were performed in double precision. The domain is a rectangle 6000 km by 4400 km. The 
coarsest mesh, Ax = Ay = 400 km. This means that the number of grid points in the 
x*direction, NC, is 15 and the number of grid points in the y-direction, NROW is 11. (At 
will be adjusted for stability.) The number of time steps, NLIMIT, is 30. 



NC 


NROW 


Ax(km) 


Ay (km) 


Af(sec) 


Amdahl (sec) 


MP-1104(sec) 


15 


11 


400 


400 


18. 


1.14 


14 


48 


45 


1331 


1331 


5.51 


13.52 


31.3 


63 


62 


93.75 


70.97 


4.22 


24.8 


44.3 


88 


85 


51.76 


51.76 


3.03 


48.32 


80 


128 


125 


46.87 


46.87 


2.10 





164 



Table 2: Total CPU time for several grids 
The initial condition for the height field is given by 



h(x , y) = Ho + H , tanh 9(£> ^ ^ 4- 



Hi 



cosh 2 



sin 



27TX 

~L 



where 

H 0 = 2000m, ^ = -220 m, H 2 = 133m, 
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and 



fo = 10 *sec 1 , (3 = 1.5 x 10 11 sec 1 m V 

This initial condition is given in Grammeltveldt (1969) and tested by several researchers 
(Cullen and Morton (1980), Gustafsson (1971), Navon (1987) etc.) The initial velocity fields 
were derived from the initial height field via the geostrophic relationships 

9 dh 

U = ~Wy 



- i — 

f dx 



Table 2 gives the CPU time for each grid. 

If we compare the CPU time for three of the subroutines we parallelized (to avoid the 
difficulty that some parts are still running on the front end) we find that in MAMULT and 
SOLVER we were able to cut the CPU time. The results are summarized in Table 3. 



Subroutine 


Problem size 


Amdahl (sec) 


MP-1104 (sec) 


ASSEM 


48 by 45 


3.02 


5.77 




63 by 62 


5.47 


8.56 




88 by 85 


10.49 


15.2 




128 by 125 


- 


34.4 


MAMULT 


48 by 45 


.42 


.44 




63 by 62 


.74 


.37 




88 by 85 


1.44 


.88 




128 by 125 


- 


1.53 


SOLVER 


48 by 45 


7.21 


5.97 




63 by 62 


13.14 


4.87 




88 by 85 


25.38 


10.6 




128 by 125 


- 


17.9 



Table 3: CPU time before and after parallelization 

The code was ran under profiler and we found that now the CPU usage (in percent of 
total CPU) is as given in table 4. 

It is clear that one should parallelize DX,DY,PENTDG,TRIDG and LOOK. The first 
four require that one parallelizes the subroutines NCTRD,CYCTRD and CYCPNT. This is 
not done since the tridiagonal and pentadiagonal systems to be solved are of order NC. We 
feel that one should approach this problem slightly differently. Instead of trying to parallelize 
this code which is of high order, we should parallelize a low order finite element code for the 
shallow water equations. The accuracy of the solution will be obtained by using an even finer 
mesh than 46 km (NC=128) we used above. It will be interesting to compare the accuracy 
and efficiency of the two codes on MP-1104 machine. 
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Subroutine 

FUNCT 

DX 

DY 

ASSEM 

PENTDG 

MAMULT 

TRIDG 

LOOK 

NCTRD 

CYCPNT 

CYCTRD 



SOLVER 8.0 

SET STI 1.0 

BOUND 1.8 

VFEUDX 1.8 

rest 2.8 



88 by 85 


128 by 125 


17.0 


18.6 


16.6 


20.0 


16.0 


14.3 


13.7 


11.4 


9.8 


6.9 


6.9 


5.2 


4.4 


8.4 


3.3 


2.5 


3.2 


2.4 


2.1 


1.5 


1.9 


1.1 


1.4 


1.2 


1.0 


3.9 


.6 


.5 


2.1 


2.1 



15 by 11 44 by 45 

36.8 

3.2 12.3 

3.2 12.8 

10.2 17.9 

2.5 12.0 

16.2 13.7 

1.2 6.5 

9.1 4.1 

.7 3.2 

.7 3.9 

.8 2.6 

4.0 
1.7 
1.7 

1.3 



Table 4: CPU time by subroutine after parallelization 



Conclusion 

We have developed a high order finite element code to solve the shallow water equations 
on the MasPar massively parallel computer MP-1104. It is believed that a low order finite 
element code will be more efficient on the MP-1104 computer. 
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SUBROUTINE ASSEM (COMA, STI ,NSWTCH,CODI .AREA, tnod, tlocat) 
INCLUDE ’PAR* 

include 'inter.info’ 

real COMA (7 , NDDEB) , STI (NNOD ,NN0D , NELE) 

real CODI (NNOD, NELE) 

integer itemp, tl, t2, t3, t4, tO, timel, time2 
real tgmat (7, NELE) 

integer tlocat (6 .NODE) , tnod (NNOD, NELE) , nsvtch 

CMPF ONDPU STI, CODI, COMA 
cmpf map coma (memory, allbits) 
cmpf map codi (memory , allbits) 
cmpf map tgmat (memory, allbits) 
cmpf map tlocat (memory, allbits) 
cmpf map tnod (memory, allbits) 
cmpf map sti (memory, memory, allbits) 

COMA * 0.0 
C 

C DECIDE WHICH ELEMNT MATRIX MUST BE CALCULATED 

GOTO (100,200,500,600), NSWTCH 
C 

100 continue 

call assem_sl(codi , tgmat, tlocat, tnod) 
coma(:,:) = tgmat (:, :N0DEB) 

RETURN 

C 

200 continue 

vrite(6,*) ’ error for nsvitch =2’ 

RETURN 

C 

500 continue 

call assem_s2(area, tgmat, tlocat, tnod) 
coma(:,:) « tgmat (:, :N0DEB) 

RETURN 

C 

600 continue 

call assem_s3(sti , tgmat, tlocat, tnod) 
coma(:,:) = tgmat (:, :N0DEB) 

return 

END 



11 



subroutine assem.sl (codi , gmat, locat , nod) 
include 'PAR’ 

real, intent (in) :: codi(NN0D, NELE) 
real, intent (out) :: gmat (7, NELE) 

integer, intent(in) :: locat (6, NODE) , nod (NNOD, NELE) 
real t cod i (NNOD, NELE) 
cmpf map codi (memory, allbits) 
cmpf map gmat (memory , allbits) 
cmpf map locat (memory , allbits) 
cmpf map nod (memory , allbits) 
cmpf map tcodi (memory , allbits) 



integer irow(NELE), icol(NELE), i, k, j, 1 

gmat = 0 . 
tcodi = codi/6. 

do 100 k * 1, NNOD 
irow = nod(k, : ) 

do 150 j - 1, NNOD 



icol = nod(j , : ) 
if( k .eq. j) then 



cmpf collisions 



gmat (7, irov)=gmat(7,irow)+tcodi(j , : ) 



goto 150 

endif 

do 200 1 = 1, 6 



where (locat (1, irow) .eq. icol) 



cmpf collisions 



gmat (1, irow) = gmat (1, irow) + tcodi (j,:) 
end where 



200 

150 

100 



continue 

continue 



continue 



return 

end 
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subroutine assem_s2(area, gmat , locat, nod) 
include 'PAR* 

real area, tareal2, tarea6 

integer locat (6, NODE) , nod (NNOD, NELE) 

real gmat (7,NELE) 

cmpf map gmat (memory , allbits) 
cmpf map locat (memory, allbits) 
cmpf map nod(memory, allbits) 

integer irow(NELE), icol(NELE), i, k, j 

gmat = 0 . 

tareal2 = area/12. 
tarea6 « tareal2 * 2. 



do 100 k = 1, NNOD 
irov = nod(k, : ) 

do 150 j - 1, NNOD 
icol = nod(j , :) 

if C k .eq. j) then 

cmpf collisions 

gmat (7 , irov) =gmat (7 , irov)+tarea6 
goto 150 
endif 

do 200 1 * 1, 6 

where (locat (1, irov) .eq. icol) 



cmpf collisions 



gmat (1, irov) = gmat (1, irov) + tareal2 
end where 



200 

150 

100 



continue 

continue 

continue 



return 

end 
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subroutine assem_s3(sti , gmat, locat, nod) 
include 'PAR' 

real, intent(out) :: gmat (7,NELE) 

real, intent(in) :: sti(NNOD,NNOD,NELE) 
integer, intent(in) :: locat (6 , NODE) , nod(NNOD.NELE) 
cmpf map gmat (memory, allbits) 
cmpf map locat (memory , allbits) 
cmpf map nod(memory , allbits) 
cmpf map sti (memory, memory, allbits) 

integer irow(NELE), icol(NELE), i, k, j, 1 



gmat = 0 . 

do 100 k = 1, NNOD 
irov = nod(k, :) 

do 150 j = 1, NNOD 
icol = nod(j , : ) 
if( k .eq. j) then 

cmpf collisions 

gmat(7,irov)=gmat(7,irov)+sti(k,j , :) 
goto 150 
endif 



cmpf collisions 



200 

150 

100 



do 200 1 = 1, 6 

where(locat(l ,irow) .eq. icol) 



gmat(l,irov) = gmat(l,irow) + sti(k,j,:) 
end where 
continue 
continue 
continue 



return 

end 
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SUBROUTINE MAMULT (COMA, VECTOR, RIGHT, locat) 

INCLUDE ’PAR' 

real, intent(in) :: C0MA(7,N0DEB) , vector(:) 

real, intent (out) :: RIGHT(:) 
integer, intent (in) :: locat(6,N0DE) 
integer nloc(NODE) 
integer kr 

cmpf map coma(memory, allbits) 
cmpf map locat (memory , allbits) 

right = 0. 

RIGHT(tNODE) = coma(7, :node)*vector( mode) 

DO 80 KR=1 ,6 

nloc( : )=locat (kr , : ) 
where (nloc ( : ) .ne .0) 

& RIGHT( mode) = RIGHT(mode) + C0MA(KR, :node)*VECT0R(nloc( : ) ) 

80 CONTINUE 
RETURN 

end 

! Conjugate Gradient Square (CGS) method to solve non-symmetric 
! positive definite metrix. Ax = b 

i 

! coma = input matrix A 
! right * b 
! xsolv = x 

subroutine my_solver (coma, right , xsolv, eps , itermx, locat) 
include 'PAR* 
include ’mamult.if’ 

real coma(7 ,N0DEB) , eps 

real right (NODEB) , xsolv(NODEB) 

real, dimension(NODE) :: r, rbar, p, pi, q, u, mv 

real*8, dimension(NODE) :: brbar, bpl 

real beta, convl, conv2, restl, rest2 

real delO, dell, alpha, residual(lOO) 

real*8 bdelO, bdell 

integer locat (6, NODE) 

integer itermx, i, j 

common/debug/ nt ime 
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cmpf map coma(memory , allbits) 
cmpf map locat (memory, allbits) 
r = right(:node) 
convl = dotproduct(r,r) 
call mamult(coma, xsolv, mv, locat) 
r = r - mv 
rbar = r 
p * r 
u = r 

itermx * 100 

pi = 0. 

eps = 1.5e-6 

do 10 i = 1, itermx 

call mamult (coma, p, pi, locat) 
delO = dotproduct (rbar ,pl) 
dell = dotproduct (rbar, r) 
alpha = dell/delO 
q = u - alpha*pl 
u = u + q 

xsolv( :node) = xsolv(:node) + alpha*u 
call mamult (coma, u, pi, locat) 
delO = dell 
r = r - alpha*pl 
conv2 * dotproduct(r.r) 
residual (i) = sqrt(conv2/convl) 
if( residual (i) .It. eps) then 
return 

endif 

dell = dotproduct (rbar, r) 
beta = dell/delO 
u = r + beta*q 
p = u + beta * (q + beta*p) 

10 continue 

PRINT 2001 

2001 FORMAT (1X,’N0 CONVERGENCE') 

do 20 i s 1, itermx 

write(6, *) i, ' residual is ', residual(i) 

20 continue 

stop 
END 
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